A Phrase-Based Statistical Model for SMS Text Normalization

نویسندگان

  • AiTi Aw
  • Min Zhang
  • Juan Xiao
  • Jian Su
چکیده

Short Messaging Service (SMS) texts behave quite differently from normal written texts and have some very special phenomena. To translate SMS texts, traditional approaches model such irregularities directly in Machine Translation (MT). However, such approaches suffer from customization problem as tremendous effort is required to adapt the language model of the existing translation system to handle SMS text style. We offer an alternative approach to resolve such irregularities by normalizing SMS texts before MT. In this paper, we view the task of SMS normalization as a translation problem from the SMS language to the English language 1 and we propose to adapt a phrase-based statistical MT model for the task. Evaluation by 5-fold cross validation on a parallel SMS normalized corpus of 5000 sentences shows that our method can achieve 0.80702 in BLEU score against the baseline BLEU score 0.6958. Another experiment of translating SMS texts from English to Chinese on a separate SMS text corpus shows that, using SMS normalization as MT preprocessing can largely boost SMS translation performance from 0.1926 to 0.3770 in BLEU score.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Rewriting the orthography of SMS messages

Electronic written texts used in computer-mediated interactions (emails, blogs, chats, and the like) contain significant deviations from the norm of the language. This paper presents the detail of a system aiming at normalizing the orthography of French SMS messages: after discussing the linguistic peculiarities of these messages and possible approaches to their automatic normalization, we pres...

متن کامل

CS224N: Investigating SMS Text Normalization using Statistical Machine Translation

In this project we explore two approaches to SMS text normalization. First we try a dictionary substitution approach used by most websites that provide such a service, and then modify it with our extension. This is followed by a statistical machine translation (MT) approach using off the shelf MT tools. We evaluate the performance of our system on three test sets from different sources and disc...

متن کامل

A REVIEW PAPER ON SMS TEXT TO PLAIN ENGLISH TRANSLATION(Text Normalization)

Mobile technology as well as social networking technology plays an important role in communication across internet. A large amount of information is found in noisy contexts as texting and chat lingo have become increasingly considerably in the past decade. This noisy information needs to be normalized into the standard text so that it can be used by the various other tools such as text-to-speec...

متن کامل

A Query - Based SMS Translation in Information Access System

13  Abstract— Mobile technology has contributed to the evolution of several media of communication such as chats, emails and short message service (SMS) text. This has significantly influenced the traditional standard way of expressing views from letter writing to a high-tech form of expression known as texting language. In this paper we investigated building a mobile information access system...

متن کامل

A Framework for Translating SMS Messages

Short Messaging Service (SMS) has become a popular form of communication. While it is predominantly used for monolingual communication, it can be extremely useful for facilitating cross-lingual communication through statistical machine translation. In this work we present an application of statistical machine translation to SMS messages. We decouple the SMS translation task into normalization f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006